Install it using this command:(write the command on the command python command prompt/ ANACONDA prompt)
pip install pandas
or
conda install pandas
Once pandas is installed, import it in your applications by adding the __import__
keyword:
The version string is stored under __version__
attribute.
import pandas as pd
print(pd.__version__)
1.4.4
mydict = {
'student': ["Arnab", "Mainak", "Kunal","Sourav"],
'Age': [24,25,25,26]
}
type(mydict)
dict
mydat= pd.DataFrame(mydict)
mydat
student | Age | |
---|---|---|
0 | Arnab | 24 |
1 | Mainak | 25 |
2 | Kunal | 25 |
3 | Sourav | 26 |
type(mydat)
pandas.core.frame.DataFrame
Similarly, using numpy we can also create a data drame just like this:
import numpy as np
# Here, np.arange(0,20).reshape(5,4) is creating a 2D array i.e., a matrix with dim (5x4).
df= pd.DataFrame(np.arange(0,20).reshape(5,4),index=None,columns=["Colm1","Colm2","Colm3","Colm4"])
print(df)
type(df)
Colm1 Colm2 Colm3 Colm4 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19
pandas.core.frame.DataFrame
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
a=[1,2,3,4,5]
s1= pd.Series(a)
s1
0 1 1 2 2 3 3 4 4 5 dtype: int64
type(s1)
pandas.core.series.Series
there are two ways for accessing elements of a dataframe:
1> loc
2> iloc
loc
returns one or more specified row(s).iloc
returns one or more specified row(s) & column(s).df.loc[0] ##1st row of the df
Colm1 0 Colm2 1 Colm3 2 Colm4 3 Name: 0, dtype: int32
type(df.loc[0])
pandas.core.series.Series
df.loc[[0,1]]
Colm1 | Colm2 | Colm3 | Colm4 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
type(df.loc[[0,1]])
pandas.core.frame.DataFrame
Note: when there is one rwo in loc
, return type will be series & when there is more than one row, the returrn type will be dataframe.
df.iloc[0,0] ## (1,1) element of df
0
df.iloc[0:2,0:3] ## 1st 2 rows and three columns
Colm1 | Colm2 | Colm3 | |
---|---|---|---|
0 | 0 | 1 | 2 |
1 | 4 | 5 | 6 |
type(df.iloc[0:2,0:3])
pandas.core.frame.DataFrame
df.iloc[:,0]
0 0 1 4 2 8 3 12 4 16 Name: Colm1, dtype: int32
type(df.iloc[:,0])
pandas.core.series.Series
Istated of using loc
or iloc
we can simple use the indexing
df[0:2] # 1st two rows
Colm1 | Colm2 | Colm3 | Colm4 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
df["Colm1"]
0 0 1 4 2 8 3 12 4 16 Name: Colm1, dtype: int32
df[["Colm1","Colm2"]]
Colm1 | Colm2 | |
---|---|---|
0 | 0 | 1 |
1 | 4 | 5 |
2 | 8 | 9 |
3 | 12 | 13 |
4 | 16 | 17 |
df.values
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]])
type(df.values)
numpy.ndarray
df["Colm1"].value_counts()
0 1 4 1 8 1 12 1 16 1 Name: Colm1, dtype: int64
type(df["Colm1"].value_counts())
pandas.core.series.Series
df[["Colm1","Colm2"]].value_counts()
Colm1 Colm2 0 1 1 4 5 1 8 9 1 12 13 1 16 17 1 dtype: int64
df2= pd.DataFrame(np.array([[1,2,3,4],
[2,4,7,9],
[6,4,5,7],
[5,5,7,9]]),columns=["colm1","colm2","colm3","colm4"])
df2
colm1 | colm2 | colm3 | colm4 | |
---|---|---|---|---|
0 | 1 | 2 | 3 | 4 |
1 | 2 | 4 | 7 | 9 |
2 | 6 | 4 | 5 | 7 |
3 | 5 | 5 | 7 | 9 |
df2[["colm2","colm4"]].value_counts()
colm2 colm4 2 4 1 4 7 1 9 1 5 9 1 dtype: int64
df.shape
(5, 4)
df2.shape
(4, 4)
df.isnull().sum() ## For NULL values
Colm1 0 Colm2 0 Colm3 0 Colm4 0 dtype: int64
df.isna().sum()
Colm1 0 Colm2 0 Colm3 0 Colm4 0 dtype: int64
df3= pd.DataFrame(np.array([[1,np.NaN,3,4],
[2,4,np.NaN,9],
[6,4,5,7],
[5,5,7,9]]),columns=["colm1","colm2","colm3","colm4"])
df3
colm1 | colm2 | colm3 | colm4 | |
---|---|---|---|---|
0 | 1.0 | NaN | 3.0 | 4.0 |
1 | 2.0 | 4.0 | NaN | 9.0 |
2 | 6.0 | 4.0 | 5.0 | 7.0 |
3 | 5.0 | 5.0 | 7.0 | 9.0 |
df3.isna().sum()
colm1 0 colm2 1 colm3 1 colm4 0 dtype: int64
df3.isnull().sum()
colm1 0 colm2 1 colm3 1 colm4 0 dtype: int64
iris=pd.read_csv("D:/Users/User/Downloads/iris.csv")
iris.head(5)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
iris.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sepal.Length 150 non-null float64 1 Sepal.Width 150 non-null float64 2 Petal.Length 150 non-null float64 3 Petal.Width 150 non-null float64 4 Species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB
df3.dropna()
colm1 | colm2 | colm3 | colm4 | |
---|---|---|---|---|
2 | 6.0 | 4.0 | 5.0 | 7.0 |
3 | 5.0 | 5.0 | 7.0 | 9.0 |
Up to this point we’ve been focused primarily on one-dimensional and two-dimensional data, stored in Series and DataFrame objects, respectively.
Often it is useful to go beyond this and store higher-dimensional data that is, data indexed by more than one or two keys.
Example: Suppose we would like to track data about states from two different years.
Using the Pandas tools we’ve already covered i.e., we might be tempted to simply use Python tuples as keys:
index = [('California', 2000), ('California', 2010), ## Data index
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956, ## Data vales
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
(California, 2000) 33871648 (California, 2010) 37253956 (New York, 2000) 18976457 (New York, 2010) 19378102 (Texas, 2000) 20851820 (Texas, 2010) 25145561 dtype: int64
But, accessing the values from this type is actually not the conveniant, it will be messy just like:
# Now, if we want to access all data for 2010
pop[[i for i in pop.index if i[1] == 2010]]
(California, 2010) 37253956 (New York, 2010) 19378102 (Texas, 2010) 25145561 dtype: int64
So, instead of doing this we can use Pandas MultiIndex
.
We can create a multi-index from the tuples as follows:
index = pd.MultiIndex.from_tuples(index)
index
MultiIndex([('California', 2000), ('California', 2010), ( 'New York', 2000), ( 'New York', 2010), ( 'Texas', 2000), ( 'Texas', 2010)], )
Now, If we reindex our series(i.e., pop
) with this MultiIndex, we see the hierarchical representation of the data:
pop= pop.reindex(index)
pop
California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 dtype: int64
# Now to access all data for which the second index is 2010
pop[:,2010]
California 37253956 New York 19378102 Texas 25145561 dtype: int64
Concatenation of Series and DataFrame objects is very similar to concatenation of NumPy arrays, which can be done via the np.concatenate
## Concatenating two series
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
1 A 2 B 3 C 4 D 5 E 6 F dtype: object
df1 = pd.DataFrame([["A1","B1"],
["A2","B2"]],index=[1,2],columns=["A","B"])
print("df1:\n",df1,"\n")
df2 = pd.DataFrame([["A3","B4"],
["A5","B6"]],index=[3,4],columns=["A","B"])
print("df2:\n",df2,"\n")
print("Concatenated Data:\n",pd.concat([df1, df2]))
df1: A B 1 A1 B1 2 A2 B2 df2: A B 3 A3 B4 4 A5 B6 Concatenated Data: A B 1 A1 B1 2 A2 B2 3 A3 B4 4 A5 B6
df1 = pd.DataFrame([["A1","B1"],
["A2","B2"]],index=[1,2],columns=["A","B"])
print("df1:\n",df1,"\n")
df2 = pd.DataFrame([["C1","D1"],
["C2","D2"]],index=[1,2],columns=["C","D"])
print("df2:\n",df2,"\n")
print("Concatenated Data:\n",pd.concat([df1, df2],axis=1))
df1: A B 1 A1 B1 2 A2 B2 df2: C D 1 C1 D1 2 C2 D2 Concatenated Data: A B C D 1 A1 B1 C1 D1 2 A2 B2 C2 D2
An essential piece of analysis of large data is efficient summarization: computing aggregations like sum, mean, median, min, and max, standard deviation, summary etc.
Listing of Pandas aggregation methods
Function | Description |
---|---|
count() |
Total number of items |
first() ,last() |
First and last item |
mean() ,median() |
Mean and Median |
min() , max() |
Minimum and Maximum |
std() , var() |
Standard deviation and Variance |
mad() |
Mean absolute deviation |
prod() |
Product of all items |
sum() |
Sum of all items |
describe() |
Summary of Dataframe or Series |
iris.iloc[:,:4].sum()/iris.shape[0] ## Mean
Sepal.Length 5.843333 Sepal.Width 3.057333 Petal.Length 3.758000 Petal.Width 1.199333 dtype: float64
iris.iloc[:,:4].mean() ## Mean
Sepal.Length 5.843333 Sepal.Width 3.057333 Petal.Length 3.758000 Petal.Width 1.199333 dtype: float64
iris.iloc[:,:4].describe() ## Summary
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the socalled groupby operation.
A canonical example of this split-apply-combine operation, where the “apply” is a summation aggregation, is illustrated in the following diagram:
iris.groupby("Species").min()
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
Species | ||||
setosa | 4.3 | 2.3 | 1.0 | 0.1 |
versicolor | 4.9 | 2.0 | 3.0 | 1.0 |
virginica | 4.9 | 2.2 | 4.5 | 1.4 |
Using the describe()
method of DataFrames to perform a set of aggregations that describe each group in the data:
iris.groupby("Species")["Sepal.Length"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Species | ||||||||
setosa | 50.0 | 5.006 | 0.352490 | 4.3 | 4.800 | 5.0 | 5.2 | 5.8 |
versicolor | 50.0 | 5.936 | 0.516171 | 4.9 | 5.600 | 5.9 | 6.3 | 7.0 |
virginica | 50.0 | 6.588 | 0.635880 | 4.9 | 6.225 | 6.5 | 6.9 | 7.9 |